Playing with distances: Document Similarity
نویسندگان
چکیده
Spoken information retrieval is a promising domain of research. In this paper we describe our participation in the pilot Document Similarity Amid Automatically Detected Terms task of FIRE 2014. We present the findings on our experiments with variants of distance and timestamp based approaches. The de-normalized distance based variant outperformed other two delivering best results of the submitted runs. However, there is scope for further improvement in the results.
منابع مشابه
Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملThe Quadratic-Chi Histogram Distance Family - Appendices
This document contains the appendices for the paper “The Quadratic-Chi Histogram Distance Family” [1], proofs and additional results. In section 2 we prove that all Quadratic-Chi histogram distances are continuous. In section 3 we prove that EMD, ÊMD and all Quadratic-Chi histogram distances are Similarity-Matrix-QuantizationInvariant. In section 4 we present additional shape classification res...
متن کاملEnhancement of Search Results Using Dynamic Document Seed Reranking Algorithm
We proposed an algorithm to improve the precision of top retrieved documents by reordering the retrieved documents in the initial retrieval. To re-order the documents, we first automatically extract key terms and key phrases from top N retrieved documents and generate a document index for each document. Using the standard similarity metrics, a document similarity matrix is generated for these d...
متن کاملنقش ارتباطات معنایی در بهبود نتایج یک سیستم پیشنهاد استناد- مقاله برگزیده هفدهمین کنفرانس ملی انجمن کامپیوتر ایران
With the increasingly growth of scientific documents in the Web, it is difficult to select a concerned document. A citation recommendation system receives a text and recommends documents to be cited by the text. Such recommendation helps a researcher in hitting his/her concerned texts. Based on sematic relations, this paper presents a new indicator to measure the similarity between documents an...
متن کامل